60 research outputs found

    Active Learning with Multiple Views

    Full text link
    Active learners alleviate the burden of labeling large amounts of data by detecting and asking the user to label only the most informative examples in the domain. We focus here on active learning for multi-view domains, in which there are several disjoint subsets of features (views), each of which is sufficient to learn the target concept. In this paper we make several contributions. First, we introduce Co-Testing, which is the first approach to multi-view active learning. Second, we extend the multi-view learning framework by also exploiting weak views, which are adequate only for learning a concept that is more general/specific than the target concept. Finally, we empirically show that Co-Testing outperforms existing active learners on a variety of real world domains such as wrapper induction, Web page classification, advertisement removal, and discourse tree parsing

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    Learning multiple views with orthogonal denoising autoencoders

    Get PDF
    Multi-view learning techniques are necessary when data is described by multiple distinct feature sets because single-view learning algorithms tend to overt on these high-dimensional data. Prior successful approaches followed either consensus or complementary principles. Recent work has focused on learning both the shared and private latent spaces of views in order to take advantage of both principles. However, these methods can not ensure that the latent spaces are strictly independent through encouraging the orthogonality in their objective functions. Also little work has explored representation learning techniques for multiview learning. In this paper, we use the denoising autoencoder to learn shared and private latent spaces, with orthogonal constraints | disconnecting every private latent space from the remaining views. Instead of computationally expensive optimization, we adapt the backpropagation algorithm to train our model

    Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach

    No full text

    Wrapper Maintenance for Web-Data Extraction Based on Pages Features

    No full text

    Bottom-Up Learning of Logic Programs for Information Extraction from Hypertext Documents

    No full text
    We present an inductive logic programming bottom-up learning algorithm (BFOIL) for synthesizing logic programs for multi-slot information extraction from hypertext documents. BFOIL learns from positive examples only. Furthermore we introduce a logical and relational based representation for hypertext documents (TDOM). We briefly discuss several BFOIL refinements and show very promising results of our system LIPX in comparison to state of the art IE systems

    Selective Sampling Based on Dynamic Certainty Propagation for Image Retrieval

    No full text

    An Investigation on Genetic Algorithms for Generic STRIPS Planning

    No full text
    corecore